Method and system for data extraction from images of semi-structured documents

ABSTRACT

The present invention is directed to a method of extracting data from fields in an image of a document. In one implementation, a text representation of the image of the document is obtained. A graph for storing features of the text fragments in the text representation of the image of the document and their links is constructed. A cascade classification for computing the features of the text fragments in the text representation of the image of the document and their link is run. Hypotheses about the belonging of text fragments to the fields in the image of the document are generated. Combinations of the hypotheses are generated. A combination of the hypotheses is selected. And data from the fields in the image of the document is extracted based on the selected combination of the hypotheses.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of priority under 35 USC 119to Russian Patent Application No. 2015137956, filed Sep. 7, 2015; thedisclosure of which is herein incorporated by reference in its entiretyfor all purposes.

FIELD OF THE INVENTION

The present invention relates to the field of data extraction fromimages of documents by means of Optical Character Recognition (OCR).More specifically, the present invention relates to utilizing graphbased approach, the cascade classification approach, as well as areduced alphabet approach, to reduce fields' identification errors inprocessing textual information in the images of documents.

BACKGROUND OF THE INVENTION

For semi-structured documents like checks, business cards, passport (ofdifferent countries), credit cards, it is common to have a variety oflocation of the fields from copy to copy. The known methods ofidentification for such fields in semi-structured documents are based ona <<greedy algorithm>>—all the fields are searched for in the text in agiven order. If a fragment of the text is identified as a field, thisfragment is not considered in subsequent search procedures. Thisapproach imposes harsh obligations on the quality of work for the firstfield search procedures and degrades the quality of work of thesubsequent field search procedures. The first field search proceduremakes a decision about whether a text fragment is a searched filed of asemi-structured document or not without any information about theresults of subsequent search procedures or about the document as awhole. As a result the fields are often identified incorrectly.

To solve this problem we propose a method described herein using a graphstructure. The graph enables us to save the results of all searchprocedures and to implement an examination of different combinations ofthe fields during further analysis of the search results. Besides ourmethod allows to organize the work of the field search procedures bycascade classification, which allows us to save computational resourcesand to calculate only the required number of features for display. Alsoour method uses a reduced alphabet technique for generating dictionariesof keywords, which decreases the number of mistakes in the fieldsidentifying by the search procedure employing the dictionaries ofkeywords.

SUMMARY OF THE INVENTION

The present invention is directed to a method of extracting data fromfields in an image of a document. In one implementation, a textrepresentation of the image of the document is obtained. A graph forstoring features of the text fragments in the text representation of theimage of the document and their links is constructed. A cascadeclassification for computing the features of the text fragments in thetext representation of the image of the document and their link is run.Hypotheses about the belonging of text fragments to the fields in theimage of the document are generated. Combinations of the hypotheses aregenerated. A combination of the hypotheses is selected. And data fromthe fields in the image of the document is extracted based on theselected combination of the hypotheses.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the method of automated extraction of meaningfulinformation from a semi-structured document.

FIG. 2 illustrates the key steps of the entity extraction.

FIG. 3 illustrates the method of identifying the fields in asemi-structured document utilizing a graph and a cascade classification.

FIG. 4 illustrates the method of the cascade classification.

FIG. 5 illustrates an example of graph constructed for a business card.

FIG. 6 provides a general architectural diagram for various types ofcomputers and other processor-controlled devices.

FIG. 7 illustrates an example of word representing in a dictionary usingthe technique of the reduced alphabet.

FIG. 8 illustrates an example of a word in the reduced alphabet thatcorresponds to two source words.

DETAILED DESCRIPTION OF THE INVENTION

In the context of the present invention the term “field” meansinformation in a document that needs to be identified and extracted fromthe document.

Shown in FIG. 1 is a general illustration of the method of automatedextraction of meaningful information from an image of a semi-structureddocument. The semi-structured document—is a document containing a set ofinformation fields (a document element intended for data extraction)whose design, number and layout may vary significantly in differentversions of the document. An example of such semi-structured documentcould be a sales receipt or a business card, although the present methodand corresponding system should not be limited to sales receipts orbusiness cards. At step 10 an image of a document is provided to beinput into the system. At step 12 the image is pre-processed at a stageof preliminary image processing which serves to reduce noise and variousimage imperfections, as well as to adjust the image quality to make itsuitable for further processing. Further processing at step 14 involvesdocument analysis to determine the physical structure of the analyzeddocument, such as, for example, text blocks, presence or absence of atable, and so on. At step 16 optical character recognition of thedocument is performed (conversion of images of typewritten or printedtext into machine-encoded text). Step 18 represents an entity (field)extraction step of the document processing method of the presentinvention.

FIG. 2 illustrates the key steps of the entity (fields) extraction step18 in FIG. 1. Step 20 in FIG. 2 illustrates constructing a graph as astructure for storing numerical characteristics of the fragments of thetext and their links. Step 22 in FIG. 2 shows identification of thefields in the image of the document by utilizing the method of cascadeclassification. Fields in a document may be simple (without an internalstructure, e.g. value of the goods) or composite (with an internalstructure, e.g. an address field). Therefore in step 24 in FIG. 2 thecomponents of fields are identified by using, for example, regularexpressions, key words and other information that is of interest.Extraction of the desired identified information (data) is illustratedin step 26 of FIG. 2.

The structure, which is used for storing various characteristics of atext representation of an image of a document, is represented by agraph. The text representation is a result of an optical characterrecognition (OCR) of the document image. FIG. 5 illustrates a graphconstructed for a business card. The text representation of the businesscard (500) is represented in the form of a graph. The nodes (502) ofsuch graph comprise text fragments of the document being analyzed. Thenodes of the graph also comprise numerical characteristics of the textfragments. The nodes of the graph are compared with the fields in thedocument during document analysis. The nodes of the graph are connectedby edges (504), which store numerical characteristics of logicalconnections (links) between the text fragments. In one embodiment eachnode of the graph is matched with one word in the text and the edges inthe graph set linear order of the words in the text. The linear order isa supposed reading order of text in a document that depends on languageof the document (For example, for English documents the reading order oftext is from left to right; for Hebrew—from right to left). Each wordhas a corresponding node. During the analysis of the text that isdescribed in details below, two or more nodes may be merged together orone node can be split into two or more new nodes. The edges between thenodes may also be removed or added, the numerical characteristics of thenodes (text fragments) and edges (links) may be changed.

FIG. 3 illustrates the method of extracting data from a semi-structureddocument utilizing a graph and a cascading classification. In anillustrative example, a system for extracting the data from asemi-structured document receives a text representation of a documentimage (302). The text representation is a result of an optical characterrecognition (OCR) of the document image. At step 304 a graph isconstructed to represent the recognized text of the document. Each nodein the graph is matched with a word, a word combination, or a fragmentof the text representation (such as, for example, fragment “www” in the“http://” address, or an ending of a last name), meaning that the numberof the nodes in the graph is equal to or greater than the number ofwords in the text representation of the image of the document, and thateach node is connected to all other nodes by edges.

At step 306 a cascade classification of identified text fragments of thedocument is performed. The cascade classification is a method forcollecting information about features of text fragments, includingcomputation of the text fragments' features, and about links between thetext fragments. The cascade classification is an iterative process. Theprocess of the cascade classification runs until the collectedinformation about the text fragments and the links between them isadequate to generate hypotheses about the text fragments belonging to aparticular document field. Each iteration in the cascade classificationis a running of a particular procedure.

A procedure is a computer program function that processes a graph,calculates certain features for nodes and edges of the graph, andgenerates new features. Feature computation and feature generation isperformed based on the previous data (calculated by the previousprocedures). When a particular procedure is launched, the cascadeprinciple is performed—the nodes corresponding to the text fragmentsthat do not need to be processed with this particular procedure aredisregarded.

For example, if during the processing of a text representation of abusiness card's image the first procedure in the cascade classificationincludes a font size determination, in which text fragments (nodes)within the business cards are divided into text fragments withcorresponding type “small” and text fragments with the correspondingtype “large”, and the following procedure is, for example, searching fornames in the document, than the nodes having value “small” of feature“size” will be cut off, and the second procedure will calculate valuesonly for the nodes with the font size feature determined to be “large”.

In another example, if during previous procedures a text fragment wasidentified as a class of numbers, and the next procedure is a search ofkeywords, this next procedure will not be applied to the “number” textfragment. Keywords are words which are associated with certain fieldsand which may be used to detect fields.

FIG. 4 illustrates a block diagram of the cascade classification of thetext fragments. As an input the system receives a graph (402). The textfragments are nodes of the graph. The nodes of the graph are joined byedges (logical links between the text fragments) which set a linearorder of recognized text within the document. There can also be otheredges, alternative to linear order links.

At step 404 we select a procedure for field searching among theplurality of various procedures based on a resource consuming principle.It means that the most resource-saving procedures are selected first,and the most resource-intensive procedures are selected later. Theresource-saving procedures are, for example, searching for textfragments, consisting of letters, or text fragments including numbers,or identifying text fragment by font size, etc. An example of theresource-intensive procedure is searching for fields using a largeelectronic dictionary.

At step 406 the procedure selected at step 404 is launched. Theprocedure analyzes all text fragments (or some portion of the fragments)and generates value for a feature or values for multiple features(numerical characteristic(s)). More details about the procedures and thefeatures they identify will be provided below.

At the step 408, the graph is modified based on the values of thefeatures identified by the procedure for all or some of the textfragments. Namely the features values in the nodes (text fragments), inthe edges of the graph are updated. Also at this step any changes of thegraph structure may be performed. For example, it may become necessaryto split a text fragment, containing both letters and numbers, into twonodes to separate the letters from the numbers. This will be done byforming a new node, adding an edge from the old node to the new node,and moving the edges that were originating from the old node for the newnode.

At step 410 we determine whether the received information (computedfeatures) for the text fragments and their links is sufficient forgenerating one or more hypotheses about the types of fields associatedwith these text fragments.

If the information about the text fragments and the links is notsufficient, then the method moves to the step of selecting the nextsearch procedure (404). The next procedure is selected based on thepreviously computed features. In one embodiment each subsequentprocedure is more resource-intensive than the previous one. An exampleof a more resource-intensive procedure may be a procedure of searchingfor a text fragment satisfying a certain regular expression. A regularexpression is a sequence of characters that forms a search pattern,mainly for use in pattern matching with strings.

If the information about the text fragments and their links issufficient, then the cascade classification of the text fragments iscompleted.

Returning to FIG. 3, at step 308 the hypotheses about the text fragmentsbelonging to certain fields of the document and their combinations aregenerated. The hypothesis generation is based on the information aboutthe text fragments and their links calculated during the cascadeclassification 306. When hypotheses are generated, they are assignedcertain confidence levels. The confidence level of a hypothesis dependson the values of features of the text fragments and their links. In oneembodiment the confidence of the hypotheses is measured as a percentagefrom 0 to 100%. Multiple hypotheses with different or equal confidencelevels may be generated for a single text fragment. As a result, severalcombinations of the hypotheses about the fields of the document may begenerated. At step 310 a computing of quality of different combinationsof hypotheses about the fields of the document is performed. Thecomputed quality of the combinations of the hypotheses may be used forcomparison the combinations of the hypotheses between each other. Anumber of metrics measuring the quality of the combination of thehypotheses is taken into consideration. The first metric is a cumulativemetric of the confidence levels for a combination of the hypotheses. Thecumulative metric of the confidence levels for all the hypotheses withinthe combination of the hypotheses is computed as a sum of the confidencelevels computed for each text fragment with respect to the hypothesis.The higher the cumulative metric of the confidence levels of all thehypotheses within the combination, the better is the quality of thecombination.

The second metric is a cumulative metric of fines of compatibility ofdifferent hypotheses within one combination. The cumulative metric ofthe fines for a combination of hypotheses is computed as a sum of thefines. In this metric the combinations of the hypotheses are fined basedon certain rules describing specific fields, their characteristics,their arrangement regarding each other (the geometry of the fields) andpossible structure of a document. The rules are usually created for aparticular type of document. For example, in a business card the namefield cannot be absent, there cannot be two name fields or severaladdress fields located in different parts of the document, etc. Thesmaller cumulative metric of compatibility fines of the hypothesis, thebetter is the quality of this combination. The computed quality of acombination of hypotheses is at least in part based on the cumulativemetric of confidences and the cumulative metric of fines.

In one embodiment at step 312 we determine whether a computed quality ofa combination of hypotheses is above a predefined threshold value orwhether there is a combination of hypotheses for the entire document ofsufficient quality. If such combination is not found, then the methodreturns to step 306 and the process of obtaining additionalcharacteristics of the text fragments and their links utilizing thecascade classification resumes. In this case, in one embodiment furthermore complicated (resource-intensive) procedure are used, for example,procedures that utilizes larger glossary. If a suitable combination ofhypotheses is found, the analysis of the text of the document isconsidered completed, i.e., the fields in the image of the document areidentified. In another embodiment at step 312 the combination ofhypotheses with the highest computed quality is selected.

In the method of comparison of the combinations of the hypothesesdescribed above for each combination of hypotheses its quality iscomputed, i.e. to each combination we attribute a numerical qualityindicator that characterizes the quality of the whole combination ofhypotheses. The best combination of hypotheses is chosen by comparingthese quality indicators of combinations with each other and with apredefined threshold. Such comparison may not be very accurate. There isan alternative approach of comparing the combinations of hypotheses. Inthis approach, a set of the features or feature vector of onecombination of hypotheses is compared with a set of features or featurevector of another combination of hypotheses. The feature vector of thecombination of the hypotheses may include a list of fields and theirlinks identified by the combination of the hypotheses. Selecting thebest combination is based on a set of rules stored in the system foreach document type. For example, as a result of the procedures for thebusiness card document type, two combinations of hypotheses aregenerated. The feature vector of the first combination of hypotheses mayinclude the following: the document contains two logically linkedfields, wherein first field is Name, and the second field is Surname.The feature vector of the second combination of hypotheses may includethe following: the document contains two logically linked fields,wherein first field is Name and the second field is Position (Job). Aset of rules for business cards may include a rule according to whichthe best combination of hypotheses for business cards is a combination,which has both Name and Surname fields, and these fields are locatednext to each other (logically linked). In this example, the firstcombination of hypotheses wins, because the feature vector of thiscombination is more consistent with the rules. This alternative approachof comparison of the combinations of hypotheses is more accurate andallows to take into account more nuances.

Running procedures for ascertaining new or additional features should beperformed in the order necessary for providing high qualityclassification. The order of running classifiers is determined eithermanually or automatically.

As an example of the method of cascade classification as used in thepresent invention, the need to recognize a plurality of sales receiptsfrom several different stores is considered. The fields in salesreceipts from the same store usually have the same geometricalstructure—the locations of the fields, the font and other features donot vary on such sales receipts from the same vendor. Such featuresusually vary in their location and fonts on sales receipts fromdifferent stores or vendors. In order to classify sales receipts fromdifferent stores first it is necessary to run the vendors' searchprocedure, utilizing a dictionary with the vendors' names, i.e. to addthe vendor's identification to the cascade classifier and if a vendorwas identified, the rest of the fields will be identified utilizing thecorresponding template connected with the vendor. If, for example,vendor's cash registers have not changed since the time of the templatecreation, the use of a current template will significantly speed up theprocess of extracting data.

In order to ascertain new features, the method of the present inventionutilizes different heuristics for more precisely selecting severalpossible locations of one or another type of a certain field in thereceipt. For example, a field corresponding to the name of the vendoroften is located next to the field corresponding to the address.Therefore, if the location corresponding to the vendor field is known,then it is likely that the address text is located near the name of thevendor, so that the corresponding feature can be identified andintroduced. Conversely, if the vendor name field is found, then we canfind the address filed by utilizing the feature <<Text fragment locatesnear Vendor Name>>.

For searching the vendor's name field in sales receipts we can use the<<Text fragment locates in the first line>> feature.

One of the examples of using the heuristics in the method of the presentinvention applies to the case of variations of the features of a textfragment presented as a string of numbers (for example, “. . . 428”).Examples of the corresponding features are the following: a feature of aphone number, or a feature of the price of a purchased item or items, ora feature of the address (a number of the building, a portion of the zipcode and the like).

Also the features can be in the form of a regular expression. An exampleof such feature is <<Text fragment satisfies regular expression for dateformat>> (for instance, MM/DD/YYYY).

Also the features can be in the form of frequent key words in documents,such as sales receipts—the combination “tel” corresponding to a phonenumber, the combination “total” corresponding to the total purchaseprice and so on.

Similarly, we can use keywords such as “thank you for coming to” to findthe vendor's name field (corresponding feature is <<Text fragment isgoing after phrase “thank you for coming to”>>). Similarly, the keyword<<address>> may be used to find the address field in a sales receipt oron a business card (corresponding feature is <<Text fragment is goingafter address label>> (such as “address”)). The keyword <<company>> maybe used to find the company name field on a business card (correspondingfeature is <<Text fragment locates after company label>> (such as“company”)).

The method and system of the present invention also contemplate usingbinary features in identifying desired fields in the image of thedocument. Extracting various entities from a document, such as a salesreceipt, occurs automatically by training the system on recognizingbinary features (i.e. by classification). For example, the binaryfeature for the entity corresponding to the field with the name of avendor can be: the proximity of an address field, the presence of thequotation marks, the nearby presence of the words such as “Inc.”, “LLC”.

The keywords may be a part of a field. The examples of such features are<<Text fragment has street keyword in it>> (such as “St.”, “avenue”,“drive”) or <<Text fragment has occupation words>> (such as “agent”,“broker”, “programmer”), or <<Text fragment has city name in it>>.

To find an internet address on a business card the following featurescan be used: <<Text fragment is located after <<url>> label>> (such as“web:”, “url:”), <<Text fragment comprises symbols “www”>> (Maybe notexactly “www”, but something alike, such as “11w”, <<Text fragmentincludes domain name>> (such as “com”, “net”). Besides the listedexamples of text fragments' feature, procedures can also compute thefeatures links (edges) between the text fragments. The examples of suchfeatures are: <<The edge is between two compatible text fragments>>(such as several columns), <<The edge is between similar horizontallines>>, <<The edge is between words>>, <<The edge is over a punctuationmark>>, <<The edge is derived by a finder>>.

In the cases when only some of the fields need to be identified andextracted (for example, when only the fields corresponding to the nameof a vendor and the total purchase price are needed), the method ofcascade classification invokes only some of its procedures which arenecessary for the correct identification of only those desired fields.

An example of the identification of a vendor's phone number in a salesreceipt is now considered as an illustration of the described method. Inorder to quickly identify the phone number, regular expressions ofpossible symbols (digits in this case), and their order that is commonlyused for providing phone numbers, can be used.

If all the analyzed text fragments of the document were successfullyclassified, meaning that no unidentified fragments are left in thedocument, or that the cascade classification was stopped based on adecision, then the identification process can be completed. Theidentification can be completed if no conflict occurred during theclassification, i.e. there were no occurrences of matching one textfragment with several conflicting classes (types), such as “correspondsto a phone number” and ‘does not correspond to a phone number”.

In case when conflicts have arisen during the classification process,other methods of classification can be utilized. Such other methods areclassification by keywords, geometry of the fields in a document and soon. For example, phone and fax numbers are easily mixed up, but theclassification procedure by a keyword pointing only to a phone number ora fax number can be effective in identification the desired numbercorrectly.

It is contemplated by the present invention that the inventive methodcan use an electronic dictionary (a dictionary), but in certainimplementations of the method a dictionary is not needed. The followingexample is helpful in illustrating the embodiment when a dictionary isused. When extracting data from the fields in images of semi-structureddocuments using the method described above, many search procedures usespecialized dictionaries. For example, to identify the “name” field, asearch procedure uses a specialized dictionary of names; to identify the“address” field, a search procedure uses specialized dictionaries ofstreets and cities; to identify the “occupation” field a searchprocedure uses a specialized dictionary of professions, positions, andoccupations; to identify fields that are usually accompanied by keyword,such as Tel, Fax, ph, T, F. Total, etc., a search procedure that uses aspecialized dictionary of keywords is involved. Words found in aspecialized dictionary can be different for different languages and/orcountries. Dictionary words are used as features of the fields.

Because of the OCR errors in the text received as a result of imagedocument recognition, it can be impossible to find an incorrectlyrecognized word/keyword in a dictionary. That is especially problematicwith short CJK words. If a dictionary is used by a search procedure toidentify a field, and there is an OCR error in the word, then this wordwill not be found in the dictionary. This means that the correspondingfeature may not be computed properly utilizing the specializeddictionary and the field may not be identified correctly.

The above-described problem of dictionary use is solved in the presentinvention by using a reduced alphabet technique. The essence of thereduced alphabet technique is the following. Let us suppose there is aninitial alphabet A. Hereinafter an alphabet is a generic term for a setof characters that may include numbers, letters, punctuation marks,and/or mathematical and special symbols. In forming our reducedalphabet, first we group together characters from the alphabet A ifthese characters can be easily confused for one another. From each ofthese groups we choose a meta-representative character. The set ofmeta-representative characters of the groups forms a new alphabet B,where the set of characters in alphabet B is a subset from thecharacters of alphabet A. During the construction of alphabet B fromalphabet A, we mapped the set of characters of alphabet A into the setof characters from the alphabet B (f: A B). Now we search a word in aspecialized dictionary using alphabet B.

Combining of symbols into groups can be implemented based on OCR errorstatistics, Namely, if there is a reference text and a recognizedversion of the same text, we can identify the characters that wererecognized incorrectly, i.e. mixed up during the recognition processwith high frequency. For example, on the basis of this data, thefollowing characters may be grouped as easily confused for one another:“YyUu”, “AaAa”, “HhHH”, “OoOo0”, “Uu

” etc. from English and Russian alphabets. Such characters that can beeasily confused for one another are grouped together.

For each such group a meta representative character is either selectedor created. For example, a group of characters “YyYy” can be representedby one character “Y”.

Each character from the initial alphabet covered by UTF32 encoding isrepresented by 4 bytes. A character from the reduced alphabet may berepresented by a smaller number of bytes. As a result, replacing thecharacters from the initial alphabet by the characters from the reducedalphabet may substantially reduce the size of the storage spacenecessary for storing the dictionary.

The meta representative characters of different groups form a reducedalphabet. As a result, the reduced alphabet consists of thecharacters/symbols/letters which either cannot be confused for oneanother, or which are hard to confuse with any other character. Theresults of the recognition process of a semi-structured document and thewords in a specialized dictionary can be represented using this reducedalphabet. By replacing letters in the specialized dictionary withcorresponding meta representative characters we convert the specializeddictionary into a reduced alphabet dictionary.

FIG. 7 illustrates an example of a word represented in a specializeddictionary using the technique of the reduced alphabet. As shown in FIG.7 a business card 700 contains a phone number field, and this field isaccompanied by <<Tel>> keyword 701. During recognition process <<Tel>>can be recognized with different errors. FIG. 7 lists some of thevariants of recognition of the word “Tel” with errors 702. Based onthese examples and comparison of different versions of recognitioncharacter groups can be identified (704, 706 and 708). In these groups,meta representative characters are identified. In FIG. 7 the identifiedmeta representative characters (705, 707, 709) are marked bold. Forgroups 704, 706 and 708 they are <<T>>, <<E>>, and <<L>> respectively.In the reduced alphabet dictionary 710 “Tel” word will be representedwith the help of these meta representative characters as the word<<TEL>>.

The reduced alphabet dictionary provides not only correct identificationof the document fields, in spite of possible OCR errors inword/keywords, but can also help to correct these errors. For thispurpose, the reduced alphabet dictionary has a special structure. Astructural unit of the reduced alphabet dictionary is a word in thereduced alphabet. In one embodiment the words in the reduced alphabetdictionary are stored with their related source word(s) and with theassociated with these source word set(s) of versions of recognition ofthese source word with OCR errors. If the word in the reduced alphabetis associated with only one source word and, accordingly, with only oneset of versions of recognition of the source word with OCR errors, thenthe incorrect spelling of the source word may be corrected by using thereduced alphabet dictionary.

FIG. 8 illustrates a more complicated case where one word in the reducedalphabet corresponds to two source words. In FIG. 8 the two source wordsare Sheila (800) and Shelia (802). Using the reduced alphabet Sheila andShelia words (800 and 802) may be represented as SHEIIA word (804),because the sets of versions of recognition of these two source wordsintersect (806 and 808). In this case, the incorrect spelling of thesource word may be corrected by using the reduced alphabet dictionary ifthe erroneous versions of recognition do not fall in the intersectedsubset (810).

If the alphabet consists of hieroglyphs, such alphabet also can bedivided into several groups and the result of recognizing a hieroglyphwill be a group containing that hieroglyph, and not the code of thathieroglyph. As a result, the words are entered into the dictionary usinga certain narrowed alphabet.

Thus, even if the character in the key word has been recognizedincorrectly, the corresponding keyword will still be found in thedictionary, associated with the keyword field, and will be correctlyidentified.

FIG. 6 shows exemplary hardware for implementing the techniques andsystems described herein, in accordance with one implementation of thepresent disclosure. Referring to FIG. 6, the exemplary hardware includesat least one processor 602 coupled to a memory 604. The processor 602may represent one or more processors (e.g. microprocessors), and thememory 604 may represent random access memory (RAM) devices comprising amain storage of the hardware, as well as any supplemental levels ofmemory, e.g., cache memories, non-volatile or back-up memories (e.g.programmable or flash memories), read-only memories, etc. In addition,the memory 604 may be considered to include memory storage physicallylocated elsewhere in the hardware, e.g. any cache memory in theprocessor 602 as well as any storage capacity used as a virtual memory,e.g., as stored on a mass storage device 610.

The hardware also typically receives a number of inputs and outputs forcommunicating information externally. For interface with a user oroperator, the hardware \ may include one or more user input devices 606(e.g., a keyboard, a mouse, imaging device, scanner, microphone) and aone or more output devices 608 (e.g., a Liquid Crystal Display (LCD)panel, a sound playback device (speaker)). To embody the presentinvention, the hardware typically includes at least one screen device.

For additional storage, the hardware may also include one or more massstorage devices 610, e.g., a floppy or other removable disk drive, ahard disk drive, a Direct Access Storage Device (DASD), an optical drive(e.g. a Compact Disk (CD) drive, a Digital Versatile Disk (DVD) drive)and/or a tape drive, among others. Furthermore, the hardware may includean interface with one or more networks 612 (e.g., a local area network(LAN), a wide area network (WAN), a wireless network, and/or theInternet among others) to permit the communication of information withother computers coupled to the networks. It should be appreciated thatthe hardware typically includes suitable analog and/or digitalinterfaces between the processor 602 and each of the components 604,606, 608, and 612 as is well known in the art.

The hardware operates under the control of an operating system 614, andexecutes various computer software applications, components, programs,objects, modules, etc. to implement the techniques described above.Moreover, various applications, components, programs, objects, etc.,collectively indicated by application software 616 in FIG. 6, may alsoexecute on one or more processors in another computer coupled to thehardware via a network 612, e.g. in a distributed computing environment,whereby the processing required to implement the functions of a computerprogram may be allocated to multiple computers over a network.

In general, the routines executed to implement the embodiments of theinvention may be implemented as part of an operating system or aspecific application, component, program, object, module or sequence ofinstructions referred to as a “computer program.” A computer programtypically comprises one or more instruction sets at various times invarious memory and storage devices in a computer, and that, when readand executed by one or more processors in a computer, cause the computerto perform operations necessary to execute elements involving thevarious aspects of the invention. Moreover, while the invention has beendescribed in the context of fully functioning computers and computersystems, those skilled in the art will appreciate that the variousembodiments of the invention are capable of being distributed as aprogram product in a variety of forms, and that the invention appliesequally to actually effect the distribution regardless of the particulartype of computer-readable media used. Examples of computer-readablemedia include but are not limited to recordable type media such asvolatile and non-volatile memory devices, floppy and other removabledisks, hard disk drives, optical disks (e.g., Compact Disk Read-OnlyMemory (CD-ROMs), Digital Versatile Disks (DVDs), flash memory, etc.),among others. Another type of distribution may be implemented asInternet downloads.

Aspects of the present disclosure have been described above with respectto techniques for machine interpretation of information in textdocuments. However, it has been contemplated that fragments of thisdisclosure may, alternatively or additionally, be implemented asseparate program products or elements of other program products.

All statements, reciting principles, aspects, and embodiments of thedisclosure and specific examples thereof are intended to encompass bothstructural and functional equivalents of the disclosure.

It will be apparent to those skilled in the art that variousmodifications can be made in the devices, methods, and program productsof the present disclosure without departing from the spirit or scope ofthe disclosure. Thus, it is intended that the present disclosureincludes modifications that are within the scope thereof andequivalents.

1-19. (canceled)
 20. A method comprising: obtaining a textrepresentation of an image of a document, wherein the textrepresentation comprises one or more text fragments; constructing agraph comprising one or more features values of the one or more textfragments and one or more links between the one or more text fragments;generating one or more hypotheses about the one or more text fragmentsbelonging to one or more fields in the image of the document using theone or more text fragments and the one or more links in the graph;generating one or more combinations of the one or more hypotheses;selecting a combination of the one or more combinations; and extractingdata from one or more fields in the image of the document based on theselected combination.
 21. The method of claim 20, wherein the textrepresentation of the image of the document is a result of an opticalcharacter recognition (OCR).
 22. The method of claim 20, wherein thegraph further comprises nodes and edges, and wherein constructing thegraph further comprises: matching at least one of the nodes of the graphto at least one of a word, a word combination, or a word fragment of thetext representation of the image of the document; and connecting thenodes in a linear order by the edges.
 23. The method of claim 20,further comprising computing one or more features of the one or moretext fragments and the one or more links using a cascade classification.24. The method of claim 23, wherein the generating the one or morehypotheses about the one or more text fragments belonging to the one ormore fields in the image of the document is at least in part based onthe computed one or more features of the one or more text fragments andthe one or more links.
 25. The method of claim 20, wherein the selectingof the combination of the one or more combinations is based on acomputed quality of the one or more combinations of the one or morehypotheses.
 26. The method of claim 20, wherein the selecting of thecombination of the one or more combinations is based on comparing afirst feature vector of a first combination of the one or morehypotheses with a second feature vector of a second combination of theone or more hypotheses.
 27. A system comprising: a memory; and aprocessor communicably coupled to the memory, the processor to: obtain atext representation of an image of a document, wherein the textrepresentation comprises one or more text fragments; construct a graphcomprising one or more features values of the one or more text fragmentsand one or more links between the one or more text fragments; generateone or more hypotheses about the one or more text fragments belonging toone or more fields in the image of the document using the one or moretext fragments and the one or more links in the graph; generate one ormore combinations of the one or more hypotheses; select a combination ofthe one or more combinations; and extract data from one or more fieldsin the image of the document based on the selected combination.
 28. Thesystem of claim 27, wherein the text representation of the image of thedocument is a result of an optical character recognition (OCR).
 29. Thesystem of claim 27, wherein the graph further comprises nodes and edges,and wherein, to construct the graph, the processor is further to: matchat least one of the nodes of the graph to at least one of a word, a wordcombination, or a word fragment of the text representation of the imageof the document; and connect the nodes in a linear order by the edges.30. The system of claim 27, wherein the processor is further to computeone or more features of the one or more text fragments and the one ormore links using a cascade classification.
 31. The system of claim 30,wherein the processor is further to generate the one or more hypothesesabout the one or more text fragments belonging to the one or more fieldsin the image of the document based at least in part on the computed oneor more features of the one or more text fragments and the one or morelinks.
 32. The system of claim 27, wherein the processor is further toselect the combination of the one or more combinations based at least inpart on a computed quality of the one or more combinations of the one ormore hypotheses.
 33. The system of claim 27, wherein the processor isfurther to select the combination of the one or more combinations basedat least in part on a comparison of a first feature vector of a firstcombination of the one or more hypotheses and a second feature vector ofa second combination of the one or more hypotheses.
 34. A non-transitorycomputer-readable medium having instructions encoded thereon which, whenexecuted by a processor, cause the processor to: obtain a textrepresentation of an image of a document, wherein the textrepresentation comprises one or more text fragments; construct a graphcomprising one or more features values of the one or more text fragmentsand one or more links between the one or more text fragments; generateone or more hypotheses about the one or more text fragments belonging toone or more fields in the image of the document using the one or moretext fragments and the one or more links in the graph; generate one ormore combinations of the one or more hypotheses; select a combination ofthe one or more combinations; and extract data from one or more fieldsin the image of the document based on the selected combination.
 35. Thenon-transitory computer-readable medium of claim 34, wherein the textrepresentation of the image of the document is a result of an opticalcharacter recognition (OCR).
 36. The non-transitory computer-readablemedium of claim 34, wherein the graph further comprises nodes and edges,and wherein, to construct the graph, the processor is further to: matchat least one of the nodes of the graph to at least one of a word, a wordcombination, or a word fragment of the text representation of the imageof the document; and connect the nodes in a linear order by the edges.37. The non-transitory computer-readable medium of claim 34, wherein theprocessor is further to compute one or more features of the one or moretext fragments and the one or more links using a cascade classification.38. The non-transitory computer-readable medium of claim 37, wherein theprocessor is further to generate the one or more hypotheses about theone or more text fragments belonging to the one or more fields in theimage of the document based at least in part on the computed one or morefeatures of the one or more text fragments and the one or more links.39. The non-transitory computer-readable medium of claim 34, wherein theprocessor is further to select the combination of the one or morecombinations based at least in part on a computed quality of the one ormore combinations of the one or more hypotheses.